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A  Graphical  Exploration  of  the  IkeNet  E-mail  Dataset 


Martin  Short,  Andrea  Bertozzi,  and  Kate  Coronges  (West  Point) 

UCLA  Applied  Math 


March  15,  2012 


What  is  the  IkeNet  e-mail  dataset? 


►  Certain  cadets  at  West  Point  are  given  Blackberries  in  exchange  for 
their  willingness  to  have  data  about  their  communication  activity 
logged  and  studied. 

►  We  have  a  database  on  the  e-mail  communications  within  a  network 
of  22  such  students  over  ss  1  year,  from  May  2010  to  May  2011. 

►  Note  that  only  the  e-mails  sent  within  the  network  are  included,  not 
all  e-mails  sent  by  each  subject. 

►  There  are  ss  8500  such  emails,  and  each  includes  three  pieces  of 
information:  sender,  reciever,  and  timestamp. 

►  Today  I  will  show  you  several  plots  made  from  this  data,  to  hopefully 
elicit  ideas  about  further  avenues  of  exploration. 
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First,  the  network  of  e-mail  traffic. 


Figure  1:  (Left)  Dots  represent  the  22  subjects,  and  a  line  connects  two  dots  if 
there  is  at  least  1  correspondence  between  the  two  in  our  dataset.  There  is  only  1 
component,  but  it  is  not  fully  connected.  (Right)  A  plot  showing  the  number  of 
e-mails  sent  from  subject  /  (row)  to  subject  j  (column).  Note  this  is  a  directed 
graph,  and  the  matrix  is  not  symmetric. 
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A  histogram  of  e-mails  per  pair. 


Figure  2:  This  is  data  from  a  symmetric  version  of  the  graph,  and  shows 
frequency  of  correspondences  per  pair.  There  are  5  pairs  not  shown  here,  each 
with  many  more  correspondences.  For  example,  pair  (9, 18)  has  1032  messages 
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Now,  if  we  threshold  the  graph  at  20  e-mails,  we  see  some 
more  detail. 


10  15  22 


Figure  3:  (Left)  Here,  we  see  that  subject  13  begins  to  stand  out  as  a  central 
figure  with  by  far  the  most  “significant”  connections.  Subjects  20  and  21  are  no 
longer  a  part  of  the  network  at  all. 
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Now  some  temporal  properties. 


0  50  100  150  200  250 


Figure  4:  (Left)  Histogram  of  e-mails  sent  per  day  in  the  IkeNet  dataset.  Note 
the  large  bar  at  <  4  e-mails  -  weekends?  (Right)  For  fun,  a  similar  plot  from 
Stephen  Wolfram,  using  his  sent  e-mails  since  1989  (I).  He  is  clearly  more  e-mail 
happy  than  the  West  Points  cadets,  but  the  general  shape  (omitting  the  origin)  is 
similar.  . . 
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But  when  are  the  e-mails  sent? 


Figure  5:  (Left)  A  histogram  of  when  IkeNet  e-mails  were  sent  during  the  day. 
We  clearly  see  a  diurnal  rhythm,  modulated  by  lunch  and  dinner  effects.  (Right) 
Histogram  of  e-mails  per  weekday,  clearly  dropping  off  on  the  weekends. 
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We  can  see  these  cycles  clearly  in  an  auto-correlation 
analysis  of  the  time  series. 


Figure  6:  But,  also  note  the  large  spike  near  the  origin,  indicating  large 
correlation  at  very  short  timescales  -  self-excitation? 
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We  can  check  for  self-excitation  using  a  “fixed-window" 
count. 


Figure  7:  Flere,  we  find  all  occurrences  of  subject  pairs  that  exchanged  exactly 
two  e-mails  on  a  calendar  day  (802  of  these),  then  plot  the  frequency  of  time 
intervals  between  the  two  e-mails.  The  observations  are  vastly  larger  than  chance 
at  times  less  than  around  1  hour.  311  are  separated  by  less  than  15  minutes. 
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We  might  try  fitting  a  time  series  to  a  Hawkes  process. 


►  Let’s  use  the  time  series  from  pair  (9, 18),  which  is  most  prolific  at 
1032  messages. 

►  Fit  the  data  to  a  process  of  the  form 

A  (t)  =  + 

tj<t 

using  Maximum  Likelihood  Estimation. 

►  The  best  fit  parameters  are:  n  =  0.054  per  hour,  k  =  0.585,  and 
uj^1  =  0.099  hours,  with  a  log-likelihood  of  —1303.6. 

►  These  parameters  tell  us  that  there  were  around  428  background 
events  for  this  pair,  and  604  excited  events.  That's  a  lot  of 
excitation.  . . 
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We  can  also  fit  non-parametrically  using  EM  (as  in  Mohler 
et  al.,  2011  JASA) 

Here  A(t)  =  /i(t)  +  ^  g(t  -  t/)  . 

t,<t 


Figure  8:  Here  we  show  KDEs  of  the  background  /i(t)  and  excited  kernel  g(t)  for 
pair  (9,18).  These  kernels  give  roughly  202  background  events,  and  1,100  excited 
events.  Log-likelihood  is  —1379.4,  though,  which  is  worse  than  Hawkes. 
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What  might  we  explore  next? 


►  Do  a  more  careful  EM  analysis  (MPLE?). 

►  Build  daily  and  weekly  rhythms  directly  into  the  EM  or  Hawkes 
process,  since  a  lot  of  information  is  there. 

►  Explore  data  from  other  pairs  (/,_/),  and  perhaps  include  multi-party 
interactions. 

►  Look  (much)  more  deeply  into  the  graph  structure.  Perhaps  using 
some  of  Uminsky's  coalition  finding  techniques? 

►  Try  to  obtain  more  data  from  different  sources  (GMail?)  on  frequency 
of  emails  sent  per  day,  to  explore  perhaps  a  simple  model  that 
explains  similarities  (and  differences)  between  IkeNet  and  Wolfram. 

►  All  this,  and  much,  much  more.  .  . 
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